Curve similarity to extend dataset (failed)

Since our drive dataset is too small in order to build a classifier on it, we need to extend the set.

How are we gonna do it?

We collected a lot of timerelated data, which we will not have in real world application of this classifier. Based on that timerelated data, we want to predict whether a article is evergreen or not. This may be evaluated by the views in the days after publication.

Steps:

Getting only relevant data

The Statistic Normalization Function

Since some publishers and articles are more famous than others and also that evergreens have a high decrease in views in the first days, we are normalizing based on the given days.

Thus, we can highlight the behavior in the first days and differ between evergreens and non evergreens. Note that normalization function such as Softmax or Symmetric normalization ruin structures of the evergreen article due to the high amount of nonevergreen articles

We take a normalization function. This function takes the first x days to calculate its average and than normalize the past days based on that.

pageviews

It is noticeable that both nonevergreen as well as evergreen articles have significant popularity differences between the first four days and the days after. However, the decrease of nonevergreen pageviews are way heavier than Evergreen articles. Thus, we can consider Evergreen articles as more consistent when it comes to pageviews over time.

Convert into vector

Using the fullbatch we can observe that event_evs and zeitlos_evs are pretty similar, as expected. ~91% similarity

there are some article which have not gained views until 80 days. Thus we pad them with their median value.

Problems

This approach would have been a great classifier since it has the most reliable performance and indeed the best accuracy.

However there are still some issues to be considered if we want to use it as a data extension: